262

17

Genomics

and, hence, that appears as conserved regions in a group of evolutionarily related

gene sequences. This is not a strong definition, not least because the motif concept

is based on a mosaic view of the genome that is opposed to the more realistic (but

less tractable) systems view.

The construction of the concise descriptions could be either deductive or inductive.

A difficulty is that extant natural genomes are not elegantly designed from scratch,

but assembled ad hoc, and refined by “life experience” (of the species). The use of

fuzzy criteria may help to overcome this problem.

In practice, intrinsic methods often boil down to either computing one or more

parameters from the sequence and comparing them with the same parameters com-

puted for sequences of known function, or searching for short sequences that expe-

rience has shown are characteristic of certain functions.

17.5.1

Signals

In the context of intrinsic methods for assigning a function to DNA, the term “signal”

denotes a short sequence relevant to the interaction of the gene expression machinery

with the DNA. In effect, one is paralleling the action of the cell (e.g., the transcription,

splicing, and translation operations) by trying to recognize where the gene expression

machinery interacts with DNA. In a sense, therefore, this topic belongs equally well to

interactomics (Chap. 23). Much use has been made of so-called consensus sequences,

which are formed from sequences well conserved over many species by taking the

most common base at each position. The distance (e.g., the Hamming distance) of

an unknown sequence from the consensus sequence is then computed; the closer

they are, the more likely it is that the unknown sequence has the same function as

that represented by the consensus sequence. Useful signals include start and stop

codons (Table 7.1). More sophisticated signals include sequences predicted to result

in unusual DNA bendability or known to be involved in positioning DNA around

histones, intron splice sites in eukaryotic pre-mRNA and sequences corresponding

to ribosome binding sites on RNA, and so on.

Special effort has been devoted to identifying promoters, which are of great interest

as potential targets for new drugs. It is a hard problem because of the large and variable

distances between the promoter(s) and the sequence to be transcribed. The approach

relies on relatively well-conserved sequences (i.e., effectively consensus sequences)

such as TATA or CCAAT. Other sites for protein–DNA interactions can be examined

in the same way; indeed, the entire transcription factor binding site can be included

in the prototype object, which allows more sophistication (e.g., some constraints

between the sequences of the different parts) to be applied.